A llamafile is just a single, executable file that bundles the llama.cpp engine with model weights.
However, for a 30GB+ model like our Q8_0 GGUF, creating a 30GB executable is impractical. The real power-user workflow, which perfectly suits your goal, is to use the llamafile executable as a portable server and tell it to load an external GGUF file.
This gives you the best of all worlds:
llamafile executable that you can drop on any (Linux/macOS/Windows) machine.gemma-3-27b-it-q8_0.llama.cpp performance flags (--mlock, --threads, --n-gpu-layers) directly to it.Continue.dev need.Here is the complete walkthrough to create the ultimate, portable, high-performance coding experience.
llamafile WalkthroughWe need two things: the llamafile executable (the engine) and our high-fidelity model (the fuel).
Download the llamafile Executable:
Go to the llamafile GitHub releases page and download the main llamafile-0.8.6 (or newer) executable. We don't need a model bundled with it.
You now have your portable engine.
Download the High-Fidelity GGUF Model:
This is the same as our previous step. We'll download the 30GB Q8_0 model and place it in a models folder.
This is the core of the setup. We will run our llamafile executable and pass it all the high-performance llama.cpp flags.
First, find your PHYSICAL core count (e.g., sysctl -n hw.physicalcpu on macOS or lscpu | grep "Core(s) per socket" on Linux). We'll use 8 cores as our example.
Here is the full launch command:
Your terminal will now show server logs. You have a high-performance, OpenAI-compatible API running at http://127.0.0.1:8080.
This is what makes the llamafile server so powerful. You don't need to build a new model. You can "hot-swap" a LoRA by just adding a flag at launch.
Let's assume you've downloaded a rust-code-lora.gguf into your ./models folder.
You would simply add one flag to the launch command:
Now, the server running at http://127.0.0.1:8080 is serving your gemma-q8 model already specialized for Rust. You can have multiple scripts to launch different "specialist" servers.
Continue.dev for the Best Code ExperienceThis is the final step. We will point Continue.dev at our new, high-performance llamafile server.
Continue.dev can connect to any OpenAI-compatible API.
Open VS Code and go to your ~/.continue/config.yaml file.
Paste in this configuration:
(Note: This setup still uses your Ollama server for embeddings, as it's the simplest way to manage the mxbai-embed-large model. Your llamafile server will handle all the generation.)
Reload VS Code.
You are now 100% operational. When you type @codebase in Continue.dev:
Continue.dev uses your Ollama mxbai-embed-large model to index your code.llamafile server at http://127.0.0.1:8080.llamafile server, running with locked RAM and full GPU/CPU acceleration, generates the code response using the high-fidelity gemma-q8 model (with the Rust LoRA, if you added it).You have successfully combined the raw power of a natively-run llama.cpp engine with the simplicity of a llamafile server and the deep integration of Continue.dev.